Information data retrieval, where the data is organized in terms, documents and document corpora

ABSTRACT

The invention relates to improved solutions for information retrieval, wherein the information is represented by digitized text data. This data is further presumed to be organized in terms ( 431 - 438 ), documents and document corpora, where each document contains at least one term ( 431 - 438 ) and each document corpus contains at least one document. Based on a concept vector ( 420 - 424 ), which conceptually classifies the contents of each document, a term-to-concept vector is generated for each term ( 431 - 438 ) in the document corpus. The term-to-concept vector describes a relationship between the term ( 431 ) and each of the concept vectors ( 420 - 424 ). On basis of the term-to-concept vectors for the document corpus, a term-term matrix is generated which describes a term-to-term relationship between all the terms ( 431 - 438 ) in the document corpus. The term-term matrix may then be processed and used for retrieving information from the document corpus, such as the fact that a first term ( 431 ) is related to a second term ( 436 ).

THE BACKGROUND OF THE INVENTION AND PRIOR ART

The present invention relates generally to solutions for informationretrieval. More particularly the invention relates to a method ofprocessing digitized textual information.

In this specification, information retrieval is understood as the art ofretrieving document related data being relevant to an inquiry from auser. Conventionally, information retrieval systems have been built onthe idea that the user actively searches for data by specifying queries(or search phrases) based on keywords (or search terms). Over the pastdecade, and with the advent of the Internet, the research pertaininginformation retrieval has grown well past its initial goals of findingmethods for efficient indexing and searching.

Traditional information retrieval research has been focused on searchand retrieval methods based on word indexing and term vectorrepresentations. For instance, a vector similarity approach may be usedto find relationships and similarities among documents by creating aweighted list of the words (or terms) included in a document. Systemsoperating according to this principle can be regarded as“word-comparison” apparatuses, where documents and queries are comparedbased on the mutual occurrence of words. Nevertheless, if two documentsdescribe the same subject matter, however with different words, themethod is unable to find a relation between the documents.

To address this problem, and to improve the information retrievalsystems, research is currently conducted with the aim at generatingconceptual representations of documents. The conceptual representationinvolves creating relatively compact term vector representations onbasis of a word indexing produced by the earlier known methods. Forexample, the initial term vectors may be mathematically reduced to alower dimensionality using a so-called latent semantic indexing. Anotherapproach is to create a concept-representation based on the occurrenceof selected concept words. The latter approach is discussed in themaster thesis “Artificial Intelligence in an Online Newspaper”, ComputerScience & Engineering at Linköping Institute of Technology, Sweden, 2000by Löndahl et al. and in the international patent applicationWO00/63837. A feature common to the above methods is that they allresult in a document concept distribution, i.e. a weighted list ofconcept components where the number of concepts is much smaller than thetotal number of terms. Systems based on such methods may be used to findrelationships between documents, which do not share the same words.

Other examples of research related to the field of the present inventionare methods for finding semantic relationships between words. Suchrelationships are interesting to reveal, for instance, when performingword disambiguation and when creating thesauruses automatically. Worddisambiguation constitutes a considerable challenge in natural languageprocessing and involves deducing the contextual meaning of an ambiguousword, such as “bank”, which has a different meaning if the context ismoney or river. Most of the previously proposed methods are based onterm co-occurrence calculations, i.e. term relationships beingcalculated based on the frequency at which terms co-occur in the samedocuments. Research has also been conducted to find a conceptualrepresentation for words based on word proximity in a document corpus.The U.S. Pat. No. 5,325,298 discloses methods for generating or revisingcontext vectors for a plurality of word stems. The representation thusfound may be used to generate the conceptual representation of documentsin the document corpus.

Although, many of today's most advanced information retrieval systemsare generally capable of providing an accurate and comparativelyrelevant search result, there still remains progress to be made in thisarea. For instance, explicit term-to-term relationships cannot beexpressed. Thus, even though some of the known methods manage to finddocuments, which include terms that are synonymous (or by other meansequivalent) to a user's search terms, they fail to explain why thesedocuments were encountered. Another problem of the prior-art methods isthat the quality of the search result is always limited to an upperboundary given by the accuracy of the user's search query. Hence, a poorchoice of search phrase inevitably produces a relatively poor searchresult.

SUMMARY OF THE INVENTION

It is therefore an object of the present invention to alleviate theproblems above and thus provide an improved solution for processingdigitized textual information based on explicit relationships betweensynonymous terms.

It is also an object of the invention to offer an information retrievalwith an enhanced feedback, which exceeds a maximum result accuracy asgiven by an initial search phrase.

According to one aspect of the invention these objects are achieved by amethod of processing digitized textual information as describedinitially, which is characterized by generating the term-to-conceptvectors on basis of the concept vectors. Then, based the term-to-conceptvectors for the document corpus, a term-term matrix is generated whichdescribes a term-to-term relationship between the terms in the documentcorpus. Finally, the processed textual information is derived from theterm-term matrix.

An important advantage attained by the term-term matrix is that itprovides accurate connections between synonymous terms and relatedexpressions. This in turn, constitutes a basis for accomplishing highquality document searches, i.e. searches in which highly relevantinformation is identified.

According to a preferred embodiment of this aspect of the invention,each document in the document corpus is associated with adocument-concept matrix. The document-concept matrix represents at leastone concept element whose relevance with respect to the document isdescribed by a weight factor. The generation of each term-to-conceptvector comprises the following steps. First, a term-relevant set ofdocuments is identified in the document corpus (see below). Eachdocument in this term-relevant set contains at least one copy of theterm. Second, a term weight is calculated for the term in each of thedocuments in the term-relevant set. Third, a respective concept vectoris retrieved, which is associated with each document in theterm-relevant set. However, the term weight must here exceed a firstthreshold value. Fourth, a relevant set of concept vectors is selected,which includes all concept vectors where at least one concept componentexceeds a second threshold value. Fifth, a non-normalizedterm-to-concept vector is calculated as the sum of all concept vectorsin the relevant set. Finally, the non-normalized term-to-concept vectoris normalized.

This sub-procedure is advantageous because it accomplishes adequateterm-to-concept associations very efficiently. Furthermore, theprocedure may be appropriately calibrated with respect to theapplication by means of the first and second threshold values.

According to another preferred embodiment of this aspect of theinvention, the generation of the term-term matrix comprises thefollowing steps. First, a term-to-concept vector is retrieved for eachterm in each combination of two unique terms in the document corpus.Second, a relation vector is generated, which describes the relationshipbetween the terms in each combination of two unique terms. Eachcomponent in the relation vector is here equal to a lowest componentvalue of the corresponding component values in the term-to-conceptvectors. Third, a relationship value is generated for each combinationof two unique terms. The relationship value constitutes the sum of allcomponent values in the corresponding relation vector. Finally, a matrixis generated, which contains the relationship values of all combinationsof two unique terms in the document corpus.

The term-term matrix per se is a desirable result, since it forms avaluable source of synonymous words and expressions. Furthermore, theabove-proposed sub-procedure is attractive because it produces theterm-term matrix in a computationally efficient manner.

According to still another preferred embodiment of this aspect of theinvention, a statistical co-occurrence value is calculated between eachcombination of two unique terms in the document corpus. This valuedescribes the dependent probability that a certain second term exists ina document provided that a certain first term exists in the document.The statistical co-occurrence value is then incorporated into theterm-term matrix to represent lexical relationships between the terms inthe document corpus. The term-term matrix is thus improved by means of alexical relationship measure, which provides a desirable precision inmany applications.

According to yet another preferred embodiment of this aspect of theinvention, the processed textual information is displayed on a format,which is adapted for human comprehension, for instance a graphicalformat. Naturally, such presentation format improves the chances ofconveying high-quality information to a user.

According to another preferred embodiment of this aspect of theinvention, the displaying step involves presentation of at least onedocument identifier specifying a document being relevant with respect toat least one term in a query, presentation of at least one term beingrelated to a term in a query, and/or presentation of a conceptualdistribution representing a conceptual relationship between two or moreterms in the document corpus. The conceptual distribution is based onshared concepts, which are common to said terms.

All these pieces of information represent useful return data and arethus desirable in the information retrieval process.

According to another preferred embodiment of this aspect of theinvention, the initial document corpus used to create the term-to-termmatrix is pre-processed using different types of document-filters inorder to remove unwanted documents and thereby enhance relationshipquality.

According to another preferred embodiment of this aspect of theinvention, such a pre-processing document-filter is based on a clusteralgorithm. The algorithm identifies document-clusters consisting ofdocuments with high similarity. Using this information, a new documentcorpus is generated based on the initial corpus, with the differencethat each document-cluster is represented by only a reduced set ofdocuments in the new corpus. This is done in order to enhance theterm-to-term relationships by removing large sets of very similardocuments, which could otherwise bias the result.

According to yet another preferred embodiment of the aspect of theinvention, the initial document corpus used to create the term-to-termmatrix is based on user interaction, where the user select at least oneconcept or term. The document corpus is based on all documents relatedto the selected term or concept. This allows for the user to findrelationships within a certain area of interest.

According to still another preferred embodiment of this aspect of theinvention, the displaying step involves presentation of at least onedocument identifier, which specifies a document being relevant withrespect to at least one term in a query in combination with at least oneuser specified concept. This procedure may include two sub-steps where,in a first step, at least two concepts from the shared concepts in theconceptual distribution are presented to the user. In a second step, theuser indicates which concept(s) the query shall be combined with inorder to produce a more to-the-point result. This is advantageous sinceit both vouches for a user-friendly interaction and generates adequatereturn data.

According to yet another preferred embodiment of this aspect of theinvention, the conceptual relationship between a first term and at leastone second term is illustrated by means of a respective relevancemeasure, which is associated with the at least one second term inrespect of the first term. The relevance measure thus indicates thestrength of the link between the first and the second term. In mostcases this link is asymmetric, i.e. the relevance measure in theopposite direction typically has a different value.

According to another preferred embodiment of this aspect of theinvention, the strength in the conceptual relationship between two ormore terms is visualized graphically. An advantageous effect thereof isthat particular words and expressions being most closely related to eachother may be found very efficiently.

According to still another preferred embodiment of this aspect of theinvention, the processed textual information is displayed as a distancegraph where each term constitutes a node. A node representing a firstterm is thus connected to one or more other nodes that representsecondary terms to which the first term has a conceptual relationship ofat least a specific strength. The relevance measure between the firstterm and the second term is represented by a least number of node hopsthere between. This type of distance graph constitutes a first preferredexample of a source for deriving a data output in the form of conceptualrelationships between words and expressions.

According to another preferred embodiment of this aspect of theinvention, the processed textual information is displayed as a distancegraph in which each term constitutes a node. A node representing a firstterm is thus connected to one or more other nodes representing secondaryterms to which the first term has a conceptual relationship.Furthermore, each connection is associated with an edge weight, whichrepresents the strength of a conceptual relationship between the termsbeing associated with the neighboring nodes being linked via theconnection in question. The relevance measure between the first term anda particular secondary term is represented by an accumulation of theedge weights being associated with the connections constituting aminimum number node hops between the first term and the particularsecondary term. This type of distance graph constitutes a secondpreferred example of a source for deriving a data output in the form ofconceptual relationships between words and expressions.

According to yet another preferred embodiment of this aspect of theinvention, each term in the document corpus represents either a singleword, a proper name, a phrase, or a compound of single words.

According to another aspect of the invention these objects are achievedby a computer program directly loadable into the internal memory of adigital computer, comprising software for controlling the methoddescribed above when said program is run on a computer.

According to yet another aspect of the invention these objects areachieved by a computer readable medium, having a program recordedthereon, where the program is to make a computer perform the methoddescribed above.

According to still another aspect of the invention these objects areachieved by a search engine as described initially, which ischaracterized in that the processing unit in turn comprises a processingmodule and an exploring module. The processing module is adapted toreceive the term-to-concept vectors for the document corpus. Based onthe term-to-concept vectors, the processing module generates a term-termmatrix, which describes a term-to-term relationship between the terms inthe document corpus. The exploring module is adapted to receive thequery and the term-term matrix. Based on this input, the exploringmodule processes the term-term matrix and generates the processedtextual information.

This search engine is advantageous, since it is capable of identifyingrelationships between synonymous words and expression, which typicallycannot be found by the prior-art search engines. As further consequenceof the proposed search engine, relevant documents and information can beretrieved that would otherwise have been missed out.

According to still another aspect of the invention these objects areachieved by a database as described initially, which is characterized inthat it is adapted to deliver the term-to-concept vectors to theproposed search engine. A database where the information has this formatis desirable, since it shortens the average response time considerablyfor a search performed according to the proposed principle.

According to a preferred embodiment of this aspect of the invention, thedatabase comprises an iterative term-to-concept engine, which is adaptedto receive fresh digitized textual information to be added to thedatabase. Based on the added information, the iterative term-to-conceptengine generates concept vectors for any added document, and generates aterm-to-concept vector, which describes a relationship between any addedterm and each of the concept vectors. An important advantage provided bythe iterative term-to-concept engine is that it allows informationupdates without requiring a complete rebuilding of the concept vectorsand the term-to-concept vectors.

According to still another aspect of the invention these objects areachieved by a server as described initially, which is characterized inthat it comprises the proposed a search engine, and a communicationinterface towards the proposed database. This server thus makes searchesaccording to proposed method possible.

According to still another aspect of the invention these objects areachieved by a system as described initially, which is characterized inthat it comprises the above-proposed server, at least one user clientadapted to communicate with the server, and a communication linkconnecting the at least one user client with the server. Preferably, atleast a part of the communication link is accomplished over an internet(e.g. the public Internet) and the user client comprises a web browser.This browser in turn provides a user input interface via which a usermay enter queries to the server. The web browser also receives processedtextual information from the server and present it to a user. Hence, aexpedient remote access is offered to the information in the database.

Based on an amount of textual data being organized in a document corpusand a method for classifying documents on a conceptual level, theinvention thus provides a solution for generating a conceptualrepresentation of all terms in the amount of data on basis of the terms'occurrence in documents and the documents' conceptual classification. Alinkage between each term may thereby be expressed by means of asimilarity measure. This in turn, is accomplished by identifying themutual conceptual representations of term combinations followed by acomputation of a statistical measure for term co-occurrence. Aterm-to-term relationship matrix may thus be established. This matrixdescribes both a conceptual and a lexical similarity between the terms.Moreover, the matrix may be presented graphically, either as aconventional graph or as a relationship network, which is made suitablefor human comprehension.

The proposed conceptual representations and relationships allowssophisticated information retrieval operations to be performed, such asfinding related terms, identifying subject-matter being common tocertain terms and visualizing term relationships. Furthermore, documentsbeing relevant to one or more terms may be retrieved and, filtered basedon their conceptual representations.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is now to be explained more closely by means ofpreferred embodiments, which are disclosed as examples, and withreference to the attached drawings.

FIG. 1 shows a system for providing data processing services accordingto an embodiment of the invention,

FIG. 2 illustrates, by means of a flow diagram, an indexingpre-processing procedure according to an embodiment of the invention,

FIG. 3 shows a flow diagram, which provides an overview of a methodperformed by a proposed processing module,

FIGS. 4 a-c illustrate a sequence according to an embodiment of theinvention in which term-to-term relationships are established,

FIG. 5 illustrates, by means of a flow diagram, a method for generatinga term-document matrix according to an embodiment of the invention,

FIG. 6 illustrates, by means of a flow diagram, a method for updating adocument corpus with added data according to an embodiment of theinvention,

FIGS. 7 a-b illustrate how a term-to-term relationship may beestablished according to an embodiment of the invention,

FIG. 8 illustrates, by means of a flow diagram, a method for generatinga term-term matrix according to an embodiment of the invention,

FIGS. 9 a,b illustrate, by means of flow diagrams, alternative methodsfor enhancing the relationship quality, according to embodiments of theinvention,

FIG. 10 illustrates, by means of a flow diagram, the operation of aproposed exploring module according to an embodiment of the invention,

FIG. 11 illustrates, by means of a flow diagram, a method for findingbiased information according to an embodiment of the invention,

FIG. 12 shows an example of a term-term matrix, which is displayed as arelationship network according to an embodiment of the invention,

FIG. 13 shows a flow diagram, which summarizes the proposed method forprocessing digitized digital information,

FIG. 14 shows a flow diagram, which summarizes a first preferredembodiment of the proposed method for processing digitized digitalinformation, and

FIG. 15 shows a flow diagram, which summarizes a second preferredembodiment of the proposed method for processing digitized digitalinformation.

DESCRIPTION OF PREFERRED EMBODIMENTS OF THE INVENTION

The following definitions are made with respect to the disclosure of thepresent invention.

Document

Unless otherwise stated, by “document” is meant any textual piece ofinformation written in any language, for instance, an entire textdocument, a particular part of a document, a document preamble, aparagraph or another sub-part of a text. In addition to the actualinformation text (“payload”) a document may include meta information,such as data designating language, author, creation date, images, links,keywords, sounds, video clips etc.

Proper Name

Unless otherwise stated, the expression “proper name” is understood asone or more nouns that designate a particular entity (being or thing).Normally, a “proper name” does not include a limiting modifier and inmost English-language cases it is written with initial capital letters.An example of a proper name is “Capitol Hill”.

Term

Unless otherwise stated, a “term” refers to a single word, a phrase, aproper name, a compound word or another multi-word structure.

Concept

Unless otherwise stated, by “concept” is meant an abstract or a generalidea inferred or derived from specific instances. Usually, a concept maybe described by a single word, such as politics.

Document Corpus

Unless otherwise stated, the expression “document corpus” refers to acollection of documents, such as a text archive, a news feed or anarticle database. A commonly referred document corpus is theReuters-21578 Text Categorization Test Collection(www.research.att.com/˜lewis/reuters21578.html).

The present invention relates generally to the field of informationretrieval solutions for information exploration. Information exploringhere refers to the capability of providing user assistance in extractingspecific subsets of information from a larger amount of information.Information exploring also implies finding relations in a given amountof information. According to the invention, this can accomplishedwithout the use of Boolean search queries, which is otherwise thestandard procedure when working with information retrieval systems.

The core functionality of the proposed solution is based on a conceptualrepresentation of the terms used in a document corpus and the conceptualrelationships between the terms. Based on such relationships, a user canselect one or more of the terms and be presented with a relatedmaterial. The proposed system is namely capable of presenting relatedterms, related documents as well as graphical summaries of the selectedterms.

Furthermore, by using the generated conceptual relationships, the systemis able to graphically display how different pieces of information arerelated to each other and thereby allow a user to navigate through theinformation. For example, the relationships between terms can beillustrated by presenting their mutual concepts in a pie chart or bypresenting graphical networks of the term relationships. Navigationthrough the information is enabled by allowing the user to interact withthe graphical display of relationships, such as selecting (e.g. bymouse-clicking) a concept in a concept pie chart and thereby exclusivelyobtain material being related with the selected concept.

FIG. 1 shows a system for providing data processing services accordingto an embodiment of the invention. Digitized textual information, whichis presumed to be entered into the system as a document corpus, isstored in a database 130. A server 110 is connected to the database 130via a communication interface 112. At least one user client 120 may inturn gain access to services provided by the server 110 over a network140, such as the Internet.

The server 110 contains a search engine 115, which includes a processingunit 150. A processing module 151 in the processing unit 150 transformsthe documents (i.e. the digitized textual information in the documentcorpus of the database 130) into a number of conceptual relationshipmaps, which describe various relationships in the document corpus.

A user may interact with the system via a user input interface 121 a inthe user client 120, for example by entering a query Q. The query Q isforwarded to the server 110 over a first communication link 141 and aninterface 116. Based on the user's interaction with the system, forinstance, choosing a certain term in the query Q, an exploring module152 extracts relevant processed textual information R, which is producedfrom the relationships generated by the processing unit 150. Theprocessed textual information R is then returned to the user client 120via a second communication link 142 and presented to the user via a useroutput interface 121 b. Preferably, the information R is displayed in agraphical format that allows further interaction with the information R.

FIG. 2 illustrates, by means of a flow diagram, an indexingpre-processing procedure according to an embodiment of the invention.This procedure may be performed by a proposed indexing engine 320, whichwill be described further below with reference to the FIGS. 3 and 6. Thepre-processing involves extracting all terms included in an unformattedtext and assigning weights to each of the terms based on theirinformation content. A list of terms and a term-document matrix (TDM)are generated as a result of this indexing.

The TDM is a N*M matrix containing M vectors of dimensionality N, whereN represents the number of unique terms in the document corpus (usuallyapproximately equal to the number of words in the language of thedocument corpus) and M represents the number of documents in the corpus.Each vector component in the TDM contains a weight in the interval[0,1], which indicates the importance of a term in a document, or viceversa, i.e. the importance of a document to a given term.

The indexing pre-processing procedure includes the following steps. Afirst step 210 performs word splitting. This means that the text issplit into a number of words, based on an “allowed character” rule. Thedefinition of what is an “allowed character” depends on the language.Usually, at least all characters included in the language's alphabet(s)are allowed. A character in the text, which is however not contained inthe set of allowed characters results in a word split. Typically, a wordsplitting is performed when a space-character occurs.

Subsequently, a step 220 performs proper name identification in thetext. The step 220 thus identifies compound terms consisting of two ormore terms, such as “Bill Clinton”. The treatment of each proper name asa single term lowers the error rate in the information retrievalprocess, since ambiguities are thereby reduced. An example of anambiguity that will occur unless proper name identification is performedis that between “Carl Lewis” and “Lennox Lewis”. Here, the term “Lewis”would erroneously cause the search engine 115 to judge a documentcontaining “Carl Lewis” and another containing “Lennox Lewis” to berelated to each other.

After that, a step 230 removes any stop words in the text. Some termsnamely have a low or no importance to the content of a text. Preferably,such insignificant terms are removed according to a language-specificstop word list. The words “the”, “a”, “is” and “are” in the Englishlanguage constitute typical examples of stop words to be removed.

Then, a step 240 applies a stemming algorithm. This algorithm ensuresthat different forms of words that have the same word stem are treatedas a single term. Naturally, the stemming algorithm must belanguage-specific and it is applied to all words in the text. Thealgorithm removes any word suffixes and transforms the words into theircommon word stem. A commonly used algorithm for stemming in an Englishtext is the Porter stemming algorithm. Based the principles behind thisalgorithm, the person skilled in the art may design a stemmingalgorithms for any other language.

Following the step 240, a step 250 performs term weighting of the wordsin the text. Thereby, each unique term in each document is assigned aweight according to its information content. The so-called TermFrequency times Inverse Document Frequency (TFIDF) is a commonly usedmethod for this. According to a preferred embodiment of the invention,the information content in a document is determined by using anextension to the traditional TFIDF term weighting scheme. Specifically,a term position parameter p(t,d) (which will be explained below) isadded to each term.

A certain term t in a document d is thus allocated a weight w(t, d) in adocument d according to:

${w( {t,d} )} = {\frac{n( {t,d} )}{n(d)} \cdot {- {\log( \frac{N( {t,D} )}{N(D)} )}} \cdot {p( {t,d} )}}$where n(t,d) is the number of occurrences of the term t in the documentd,

-   -   n(d) is the total number of terms in the document d,    -   N(t,D) is the number of documents in which the term t exists,    -   N(D) is the total number of documents in the document corpus,        and    -   p(t,d) is a domain specific weight function dependent on the        positions of the term t in the document d.

The parameter p(t, d) is used to increase the importance of a termoccurring in, for instance, the title or preamble of a document. Forexample, a term occurring in the headline may have p(t,d)=3.0, while ithas p(t,d)=1.0 when occurring in the body text.

Finally, a step 260 normalizes the vectors in the TDM. Preferably, thenormalization is performed according to the Euclidean norm. Thus, for aterm t_(i) in a document d_(k) (i.e. position (i,k) in the term-documentmatrix) the normalization w(t_(i),d_(k)) is given by:

${w( {t_{i},d_{k}} )} = {\frac{w( {t_{i},d_{k}} )}{\sqrt{\sum\limits_{j = 1}^{N}{w( {t_{j},d_{k}} )}^{2}}}.}$

FIG. 3 shows a flow diagram, which provides an overview of the methodperformed by the processing module 151 in FIG. 1. The processing module151 performs a number of processing steps and calculations in order togenerate relationship matrices that describe various relationshipswithin the document corpus. In this context, a relationship is indicatedby a numeric value; which describes for example the similarity betweentwo terms in the document corpus. The figure shows a set of engines 320,340, 360 and 380 and illustrates how these together process the variousdata types according to the invention.

A document corpus 310 containing at least one document is presumed to beentered on a digital format and there after be stored in a computermemory storage system, such as the database 130 in FIG. 1. An indexingengine 320 extracts every term found in the document corpus 310,preferably according to the indexing pre-processing procedure describedwith reference to FIG. 2 above. The indexing engine 320 also assignsweights to the extracted terms (step 250 in FIG. 2), which specifies theterms' information importance relative to the document in which theyoccur.

A document-concept-matrix (DCM) 390 describes how the documents in thedocument corpus 310 are related to concepts. Each document in the corpus310 is thereby described by a normalized vector in the DCM 390, whichdenotes a distribution of concepts describing the particular document.For instance, in a news domain a document titled “Tony Blair attempts tosave the peace-process in Northern Ireland” would typically have aconcept distribution that indicates high relationships to the concepts“UK”, “Northern Ireland”, “Negotiations” and “Government”.

A term-document matrix (TDM) 330 describes how terms occur in documents.Each unique term in the document corpus 310 has a normalized vector inthe TDM 330, which denotes a distribution of documents that contain theterm and the term's importance in these documents. In the art ofinformation retrieval this matrix is commonly referred to as an invertedindex.

A term-concept matrix engine 340 receives the DCM 390 and the TDM 330,and on basis thereof generates a matrix of vectors, which containsweight values representing relationships between terms and concepts. Inthe DCM 390, each document is associated with a concept vector viadifferent weight values, and in the TDM 330, each term has a weightedvalue with respect to each document vector in which it occurs.

The matrix produced by the engine 340 is an N*M dimensional array ofnormalized term vectors, which each contains a set of weight values. Nhere represents the number of unique terms in the document corpus and Mrepresents the number of concepts.

The weight value lies in the interval [0,1] and indicates how closely aterm is associated with a particular-concept, based on the context inwhich the term has appeared. A high weight thus indicates a closerelationship. For example, the term “NHL” is likely to have a highrelationship with the concept “Hockey”. The procedure according to whichthe term-to-concept relationships are generated will be furtherillustrated below with reference to the FIGS. 4 a-c.

A term-concept matrix (TCM) 350 describes how the terms are related toconcepts. Each unique term in the corpus 310 has a normalized vector inthe TCM 350, which denotes a distribution of concepts describing thedocument. For instance, in a news domain the term “Bill Clinton” wouldtypically have a concept distribution indicating the concepts“President”, “Government” and “US”.

A term-term matrix engine 360 receives the TDM 330 and the TCM 350, andon basis thereof generates a term-term matrix 370, which containsvectors that describe conceptual relationships between the terms.

The term-term matrix (TTM) 370 describes how each term is related toeach of the other terms in the corpus 310. Hence, each unique term inthe corpus 310 has an entry in the TTM 370, which denotes a distributionvector of terms being related to the term. For instance, in a newsdomain the term “Bill Clinton” would typically have a term distributionincluding “George Bush”, “Al Gore” and “Hillary Clinton”.

A document-concept matrix engine 380 is used to generate conceptualrepresentations of any new documents being entered into the system,either at system start-up when a complete document corpus 310 is enteredor when updating the corpus 310 with one or more added documents. Apreferred procedure for accomplishing such information update isdescribed below and with further reference to FIG. 6. However, anyalternative method known from the prior art may equally well be used. Inany case, the engine 380 updates the DCM 390 based on the TDM 330 andthe TCM 350.

The document-concept matrix engine 380 produces the conceptualdistribution for a document, i.e. a description of the relationshipsbetween the document and all concepts in the corpus 310. In essence, thedocuments are processed by means of algorithms that find a conceptualdocument description. This description has the property that documents,which relate to the same topics, or basically has the same semanticmeaning, will receive a similar conceptual description. Any of theprior-art methods for generating conceptual descriptions of documentsmay be used for this provided that the result thereof can be expressedas a DCM, where each row is a normalized document vector, which denotesa distribution of concepts describing each document in the documentcorpus 310.

Formally the engine 380 calculates, for each document D_(i) and conceptC_(j), a document-concept relationship value rdc(D_(i),C_(j)) accordingto:

${{rdc}( {D_{i},C_{j}} )} = \frac{{rdc}( {D_{i},C_{j}} )}{\sqrt{\sum\limits_{l = 1}^{M}{{rdc}( {D_{i},C_{j}} )}^{2}}}$and forms a matrix of the relationship value rdc(D_(i),C_(j)) aselements, where each element (i, j) in the matrix contains the row-wisenormalized rdc(D_(i),C_(j)) value.

Due to the normalization, the range of the rdc(D,C) is [0,1]. A valueclose to 1 thus indicates a close conceptual relationship between thedocument and a concept, while a value close to 0 indicates no or aninsignificant relationship.

FIGS. 4 a-c illustrate a sequence according to an embodiment of theinvention in which term-to-term relationships are established. A set ofdocuments 411-414 in a document corpus are presumed to be related to anumber of concepts 420-424 as illustrated by the arrows. Furthermore, afirst term 431 (“Carl Bildt”) and a second term 432 (“Tony Blair”) areweighted in all documents 411, 412 in which they occur (see FIG. 4 b).Based on the fact that terms 431, 432 are related to the documents 411;412 and documents 411; 412 in turn are related to the concepts 421-423,the term-concept matrix engine (340 in FIG. 3) is able to computeterm-to-concept relationships between the first term 431 (“Carl Bildt”)and a second concept 422 (“Kosovo”) as shown in FIG. 4 c.

In this example, the first term 431 (“Carl Bildt”) occurs in a firstdocument 411 and in a second document 412. The first document 411 is inturn related to a first concept 421 (“Kosovo”) and the second concept422 (“UN”), while the second document 412 is only related to the secondconcept 422 (“UN”). Thus, the first term 431 (“Carl Bildt”) is relatedto both the first concept 421 (“Kosovo”) and to the second concept 422(“UN”), however, the relationship to the second concept 422 (“UN”) beingstronger.

A more exact description of this algorithm is described below withreference to FIG. 5. Here, a flowchart illustrates the differentoperations performed by the term-concept matrix engine (340 in FIG. 3)and how they interact with each other. Based on the DCM 390, theprocessing starts in a step 510 by iterating over all unique terms inthe document corpus (310 in FIG. 3). A step 520, performs, for each termt_(j), a second iteration over all concepts. The algorithm thustraverses over all positions in the resulting TCM (350 in FIG. 3). Astep 530 calculates a relation value rtc(t_(i),c_(j)) for a given termt_(i) and a given concept c_(j), according to:

${{rtc}( {t_{i},c_{j}} )} = {\sum\limits_{\{{k|{t_{j} \in \; d_{k}}}\}}^{\;}\;{{w( {t_{i},d_{k}} )} \cdot {{{rdc}( {d_{k},c_{j}} )}.}}}$

The sum is computed over all documents containing a term t_(i). Thefactor w(t_(i),d_(k)) represents a weighted value for the term t_(i) ina document d_(k) as computed by the indexing engine (320 in FIG. 3). Thefactor rdc(d_(k),c_(i)) is a value that describes a relationship betweenthe document d_(k) and the concept c_(j) as specified in the DCM (390 inFIG. 3). According to a preferred embodiment of the invention, alldocuments having a w(t_(i),d_(k))-value below a first threshold (seestep 1330 in FIG. 13) and each document having all itsrdc(d_(k),c_(j))-values below a second threshold (see step 1340 in FIG.13) are ignored. This namely reduces the noise and thus ensures that aterm's conceptual representation is exclusively based on those documentswhere the term has a particular significance, and where the documents inturn can be described by a comparatively distinct conceptualrepresentation.

The resulting sum represents a weighted relationship between a certainterm and a certain concept. In a step 540, the sum is normalized rtcusing Euclidean norm:

${{rtc}( {t_{j},c_{j}} )} = {\frac{{rtc}( {t_{i},c_{j}} )}{\sqrt{\sum\limits_{j = 1}^{M}{{rtc}( {t_{i},c_{j}} )}^{2}}}.}$

The normalized rtc-values for a specific term are stored in the TCM (350in FIG. 3) at their respective positions (i, j), thus forming anormalized term-to-concept row-vector at row i. The document-conceptengine 380 iteratively updates the DCM (390 in FIG. 3) accordingly.

FIG. 6 illustrates, by means of a flow diagram, a method for updating adocument corpus with added data according to an embodiment of theinvention. When the TCM 350 has been generated, it can be used toiteratively assign a conceptual distribution to new, previously unknownterms appearing in an added document.

In a first step 610, a document d_(k) enters the indexing engine 320where it is processed. For terms t_(i) (where i=1, . . . , m) with anexisting conceptual distribution, a step 620 retrieves the distributionrow vector from the TCM 350. The step 620 also retrieves a correspondingweight value for the term t_(i) in the document d_(k) from the TDM 330.

A step 650 calculates term-to-concept vectors for each added andpreviously unknown term t_(j) (where j=m+1, . . . , n) by iterating overall concepts (step 640), for each concept c_(s), its cumulative weightrtc(t_(new),c_(s)) in the document d_(k) according to:

${{rtc}( {t_{new},c_{s}} )} = {\sum\limits_{i = 1}^{m}{{{rtc}( {t_{i},c_{s}} )} \cdot {{{rtd}( {t_{i},d_{k}} )}.}}}$

A step 670 then assigns the cumulative weight rtc(t_(new),c_(s)) for theconcept c_(s) to each of the previously unclassified terms (step 660) inthe added document d_(k).

The term-to-concept relationship values for the added terms t_(j) arefinally normalized using Euclidean norm in a step 680. The normalizedrtc-values for term t_(j) are stored in the TCM 350 at their respectivepositions (j, s), thus forming a normalized term-to-concept row-vectorat row j.

The term-term matrix engine (360 in FIG. 3) generates an N*Nrelationship matrix of all terms in the document corpus, where N is thenumber of unique terms in the corpus. A relationship value in theinterval [0,1] is generated from each term to every other term. Thegeneration of the term-term matrix uses the TCM in conjunction with aterm co-occurrence calculation, which is described below with referenceto FIGS. 7 a-b. The merit of combining the two methods is that bothconceptual and lexical similarities can thereby be described with asingle similarity measure.

The idea of using the TCM (which may also be regarded as a network, seeFIG. 11) in order to find relationships between terms will now beelucidated. Based on relationships between a set of terms 431-434 and aset of concepts 420-424, term-to-term relationships can be generated byidentifying mutual, or shared, concept components. As an example, afirst term 431 (“Carl Bildt”) and a sixth term 436 (“Bill Clinton”)would be conceptually related, since they are both related to a firstconcept 421 (“Kosovo”) and a second concept 422 (“UN”), see bold linesFIG. 7 b.

FIG. 8 illustrates, by means of a flow diagram, a method for generatinga term-term matrix according to an embodiment of the invention. Twoinitial steps 810 and 820 in combination with two loop-back steps 841and 861 respectively accomplish a double iteration over all unique termst_(i)<>t_(j) in the document corpus. Thereby, a relation value isgenerated which describes the relationships between any specific termand each of the other terms.

For each pair of terms t_(i) and t_(j), a step 830 calculates arttc(t_(i),t_(j))-value as the sum of the lowest term-conceptrelationship values over all concepts. This corresponds to theexpression:

${{rttc}( {t_{i},t_{j}} )} = {\sum\limits_{k = 1}^{m}{\min( {{{rtc}( {t_{i},c_{k}} )},{{rtc}( {t_{j},c_{k}} )}} )}}$where c_(k) specifies a certain concept,

-   -   m represents the total number of concepts, and    -   rtc(t,c) is the relationship value defined in the TCM as        described above.

The minimum-function produces the effect that the conceptualrelationships are here defined by the mutual concepts for the terms. Allthe iterations (steps 810 and 820) result in a description of theconceptual relationships between all terms in the form of a primaryterm-to-term matrix.

In order to improve the precision of this matrix, the relationshipvalues between terms are enhanced in a step 840 based on theirstatistical co-occurrence in the document corpus. Two terms are definedas co-occurring if they are found in the same document(s). Aco-occurrence value rtto(t_(i),t_(j)) is generated, based on thedependent probability p(t_(j)∈d_(k)|t_(i)∈d_(k)) that a certain termt_(j) exists in a document d_(k) chosen at random, provided that t_(i)exists in d_(k). This definition is equivalent to the expression:

${{rtto}( {t_{i},t_{j}} )} = {{p( {t_{i}❘t_{j}} )} = \frac{p( {t_{i}\bigcap t_{j}} )}{p( t_{j} )}}$

The probabilities above are easily calculated using the TCM. Forexample, in a certain document corpus the term “NHL” and the term“hockey” may co-occur in 5% of the documents. In the same corpus, theterm “NHL” is presumed to occur in 10% of the documents. The dependentprobability of finding the term “hockey” given the term “NHL” is thus0.05/0.10=0.5. In other words, the co-occurrence between “NHL” and“hockey”, i.e. the rtto-value, is rtto(“hockey”,“NHL”)=0.5.

In a step 850, the two term-term relationship metrics are then combinedinto a final term-term relationship value rtt, which replaces theinitial rttc-value in the primary term-to-term relationship matrixaccording to:rtt(t _(i) ,t _(j))=α·rtto(t _(i) ,t _(j))+β·rttc(t _(i) ,t _(j))where α and β represent a first and a second constant, which define theimportance of the rttc- and rtto-values respectively. The choice of αand β thus controls the influence of conceptual and lexicalrelationships in the final term similarity measure.

Both the constants α and β may be chosen arbitrarily, since thertt-values are normalized using Euclidean norm in a following step 860.The matrix is normalized row-wise for a row i as follows:

${{rtt}( {t_{i},t_{j}} )} = \frac{{rtt}( {t_{i},t_{j}} )}{\sqrt{\sum\limits_{j = 1}^{N}{{rtt}( {t_{i},t_{j}} )}^{2}}}$where N is the total number of terms unique terms in the documentcorpus. As a result, the term-term matrix 375 is produced.

Please note that the co-occurrence value is based on a non-symmetricfunction, i.e. typically rtto(t_(i),t_(j))≠rtto(t_(j),t_(i)). In mostcases, the term-term relationship matrix is hence non-symmetric. This,corresponds to the case where a first term has a strong relationship toa second term, without however the second term having a strongrelationship to the first term. For example, the term “Mike Tyson” mayhave a very strong relationship to the term “boxing” whilst the term“boxing” only is weakly related to the term “Mike Tyson”.

FIG. 9 a illustrates, by means of a flow diagram, one method forenhancing the relationship quality by filtering the document corpus usedto generate the term-term matrix. The method involves three main stepsin the form of an initial step 910 in which a Document Corpus isidentified, a subsequent filtering step 920 in which the number ofsimilar documents in the document corpus is reduced, and a final step930 wherein a new Document Corpus is generated. A reduction of thenumber of similar documents in the corpus here results in that largequantities of similar documents will not bias the relationship measures.For example, if one single event is described in ten differentdocuments, terms occurring in these documents will tend to get highrelationship values, based on the fact that the event was welldocumented (rather than that the terms was very related). In order toreduce the effect of this potential flaw, the method according to thisembodiment of the invention uses a procedure based on documentclustering. The choice of clustering algorithm may vary (one example isthe well known K-means clustering based on the Document-Term vectors).Nevertheless, a set of document-clusters containing similar documentswill be produced.

Specifically, the filtering step 920 includes the following sub-steps. Afirst sub-step 920 a, identifies a number of document clusters C₁, . . ., C_(n) in the corpus by using a document clustering algorithm. For eachcluster found, sub-steps 920 b and 920 c generates a representativedocument vector by means of the clustering algorithm, for instance bycalculating the cluster centroid as the mean of all document vectors inthe cluster. The sub-step 920 c also adds the representative documentvector to the cluster. A sub-step 920 d removes all other documents(non-clustered documents that belong to the cluster from the initialdocument corpus. The procedure loops through the sub-steps 920 b through920 d via a return counter 920 e until all the document clusters C₁, . .. , C_(n) have been processed. Finally, the step 930 produces a newDocument Corpus where each cluster is represented by a clusterrepresentative vector, which reduces the above-mentioned biasing risk.

FIG. 9 b illustrates another method for choosing the document corpusused to generate the term-term matrix. An initial step 940 identifies aDocument Corpus. A subsequent step, 950, allows a user to input one orseveral terms and/or one or several concepts. Then, based on theDocument Corpus and the user input, a step 960 selects those documentsincluded in the Document Corpus that are related to the data specifiedin the user input. Finally, a step 970 produces a new Document Corpusexclusively including the documents selected in the step 960. Thisenables retrieval of relationships within a certain area of interest(for example, people being related to “Bill Clinton”, based on documentscontaining “UN”).

FIG. 10 illustrates, by means of a flow diagram, the operation of theexploring module (152 in FIG. 1) according to an embodiment of theinvention. The exploring module is used to provide services based onrelationships in the document corpus. Based on one or a plurality ofterms, the module then presents relevant documents, related terms and aconceptual distribution.

A joint concept engine (JCE) 1020 is here used to determine the conceptsbeing common to at least two terms 1010. The terms 1010 are input to theTCM 350 and the concept distribution for each term (corresponding to therespective term's row in the TCM) is sent as input to the JCE 1020. TheJCE 1020 calculates a joint concept distribution by selecting the lowestcomponent values from all the terms' concept vectors, which are given bythe TCM 350. A new vector is created based on these component values.The vector is subsequently normalized and returned as the result fromthe JCE 1020. The result from the JCE 1020 may be regarded as anexplanation of the conceptual relationship between two or more terms.For example, a user asking for the joint concepts pertaining to theterms “Madeleine Albright” and “Tony Blair” may be presented with apiechart covering the concepts “Politics” and “Balkan War”.

A concept bias engine (CBE) 1040 is used to retrieve a set of relevantdocuments, given at least one term, which not only relates to the giventerm(s), however also relates to at least one concept. The latter may besupplied directly from a user, from a subsystem or a search engine in astep 1035. For example, the at least one concept may be selected fromall concepts occurring in the term's conceptual distribution, such thatinformation will be retrieved that is related to the term in a specificway.

If no concept is used as input to the CBE 1040 via the step 1035, theresult will be a set of documents 1045 being related to the giventerm(s) 1010 without any bias. However, if a concept distribution isinput to the CBE 1040 in the step 1035 this will “bias” the set ofdocuments 1045, or re-arrange this set, based on the documents' 1045proximity to the given distribution. Specifically, the biasing isproduced on basis of the documents' conceptual representation as givenby the DCM 390.

Returning to the example stated above, a further illustrating example ishere presented in order to illustrate the use of the CBE 1040. A userwho selects the term “Madeleine Albright” would initially be presentedwith related terms, related documents, and say, a piechart including theconcepts “Politics”, “Balcan War” and “America”. If the usersubsequently selects the concept “Balcan War”, the CBE 1040 will presentdocuments that not only relates to “Madeleine Albright”, howeverspecifically concerns the “Balcan War”. Thus, the user is guided intofinding specific subsets of the document corpus that may be ofparticular interest to him/her.

FIG. 11 illustrates, by means of a flow diagram, a method for findingbiased information according to an embodiment of the invention. Based ona set of selected terms T₁, . . . , T_(n) being entered in a first step1110. Then, based thereon, a step 1115 generates a Document Corpus, forinstance according to the method described above with reference to theFIGS. 9 a or 9 b. A following step 1120 uses the TDM to find documentsD_(i) that contain the terms T₁, . . . , T_(n). Given the documents'D_(i) conceptual distributions C_(j), as indicated by the DCM in a step1130, and an input bias conceptual distribution B_(CD) received via astep 1150 in a step 1140, a step 1160 calculates a relationship valuercc(C_(j), B_(CD)) for each document D_(i) according to:

${{{rcc}( {C_{i},B_{CD}} )} = {\sum\limits_{k = 1}^{n}{C_{i,k}B_{{CD},k}}}},$where C_(i,k) is a weight for a concept k in the distribution C_(i) andB_(CD,k) is a weight for the concept k in the distribution B_(CD). Thesum is calculated over every concept. If the concept distributions C_(i)are represented as vectors, the rcc-function is equivalent to theso-called dot product. Finally, resulting documents are returned in astep 1170. These documents are ranked in descending order by the valuein the rcc-function.

Please observe the loop from the step 1110, via the step 1150 to thestep 1140. According to a preferred embodiment of the invention, basedon the term input and the JCE (1020 in FIG. 10), a number of conceptssuitable for biasing are presented to the user.

Returning now to FIG. 10. The purpose of the path engine 1060 is todescribe relationships between terms by using the term-term matrix 370plus at least one term as the input. The path engine 1060 has two modesof operation, Single Term Mode (STM) and Multiple Terms Mode (MTM).

In STM, one and only one term is supplied as input. The primary purposeof STM is to find the most relevant terms for a specific term. Forexample, if “Yasser Arafat” were used as input, the path engine 1060would typically reply “Israel”, “Benjamin Netanyahu” and “Bill Clinton”as well as corresponding relevance measures for each term. The pathengine 1060 uses the term-term matrix 370 as a graph matrix, andtraverses this graph to find any terms being related to the input. Allterms within a certain distance in the graph are then returned as aresult from the engine 1060. The distance measure may differ dependingon implementation, however reasonable measures are either the number ofgraph nodes from input or the accumulated edge weights in the graph.

In MTM, a plurality of terms are instead supplied as input. The pathengine 1060 again uses the term-term matrix 370 as a graph matrix, anduses well-known graph algorithms to calculate and return a sub-graph ofthis graph. As in STM, the algorithms apply a distance measure thatdepends on the specific implementation. The same distance measures asabove may be applied. The choice of graph algorithm determines the useof the sub-graph. For instance, Dijkstra's Shortest Path algorithmprovides the shortest path between two terms in the graph.Floyd-Warshall's algorithm provides the shortest paths between allsupplied terms. The so-called MST provides the minimal spanning treespanning all supplied terms. The purpose of the various sub-graphs is toexamine the relationship between a plurality of terms, and to allow therelationship to be graphically visualized to enable users to furtherexplore the information in the system.

An example of the use of MTM is shown in FIG. 12. The figure shows aterm-term matrix being displayed as a relationship network. Here, afirst term 431 (“Carl Bildt”), a second term 433 (“Gerhard Schroder”)and a third term 434 (“Hillary Clinton”) are presumed to be used asinput to a path engine 1060 running in MTM mode, with Floyd-Warshall asthe chosen algorithm and number of graph nodes from input as thedistance measure. The path engine 1060 calculates the shortest distancebetween all three terms 431, 433 and 434 in the graph. These paths aredisplayed as dashed lines in the figure.

As is apparent from the figure, there are three equidistant relationshippaths between the first term 431 “Carl Bildt” and the second term 433(“Gerhard Schröder”). These paths run via a fourth term 432 (“TonyBlair”), a fifth term 435 (“Kofi Annan”) and a sixth term 436 (“BillClinton”) respectively.

Furthermore, the shortest possible path from the first term 431 (“CarlBildt”) and the second term 433 (“Gerhard Schröder”) to a seventh term434 (“Hillary Clinton”) run via the sixth term 436 (“Bill Clinton”). Themerit of the MTM is that it reveals implicit relations between terms,such as “proper names”. Moreover, the relationships may easily beexplained and displayed graphically to a user, thus allowing him/her tofurther explore the information in search of relevant facts.

In order to sum up, the general method for processing digitized digitalinformation according to the invention will now be described withreference to FIG. 13. The information is presumed to be organized interms, documents and document corpora, where each document contains atleast one term and each document corpus contains at least one document.

A first step 1310 generates a concept vector for each document in adocument corpus. The concept vector conceptually classifies the contentsof the document on a relatively compact format. A following step 1320generates, for each term in the document corpus, a term-to-conceptvector which describes a relationship between the term and each of theconcept vectors. Subsequently, a step 1330 generates a term-term matrix,which describes a term-to-term relationship between the terms in thedocument corpus. The term-term matrix is produced on basis of theterm-to-concept vectors for the document corpus. Finally, a step 1340processes the term-term matrix into processed textual information, whichpreferably has a graphical format that is well adapted to becomprehended by a human user.

FIG. 14 shows a flow diagram, which summarizes a sub-procedure forgenerating a term-to-concept vector according to a preferred embodimentof the invention. Each document in the document corpus is here presumedto be associated with a document-concept matrix, which represents atleast one concept element whose relevance with respect to the documentis described by a weight factor.

A first step 1410 identifies a term-relevant set of documents in thedocument corpus. Each document in the term-relevant set contains atleast one occurrence of the term. Then, a step 1420 calculates a termweight for the term in each of the documents in the term-relevant set. Astep 1430 there after, retrieves a respective concept vector beingassociated with each document in the term-relevant set. However, acondition for including a specific concept vector is that the termweight therein exceeds a first threshold value. Subsequently, a step1440 selects a relevant set of concept vectors including any conceptvector in which at least one concept component exceeds a secondthreshold value. A step 1450 then calculates an initial non-normalizedterm-to-concept vector as the sum of all concept vectors in the relevantset. Finally, a step 1450 normalizes the initial term-to-concept vectorthat was obtained in the step 1450. Preferably, the normalizing iscarried out according to the Euclidian norm.

FIG. 15 shows a flow diagram, which summarizes a sub-procedure forgenerating the term-term matrix according to a preferred embodiment ofthe invention. A first step 1510 retrieves a respective term-to-conceptvector for each term in each combination of two unique terms in thedocument corpus. Then, a step 1520 generates a relation vector, whichdescribes the relationship between the terms in each combination of twounique terms. Each component in the relation vector is here equal to alowest component value of corresponding component values in theterm-to-concept vectors. A subsequent step 1530, generates arelationship value for each combination of two unique terms as the sumof all component values in the corresponding relation vector. Finally, astep 1540 generates a matrix, which contains the relationship values ofeach combination of two unique terms in the document corpus.

All of the process steps, as well as any sub-sequence of steps,described with reference to the FIGS. 13-15 above may be controlled bymeans of a computer program being directly loadable into the internalmemory of a computer, which includes appropriate software forcontrolling the necessary steps when the program is run on a computer.Naturally, the same is also true with respect to the proceduresdescribed with reference to the FIGS. 2-12. Furthermore, such computerprograms can be recorded onto arbitrary kind of computer readable mediumas well as be transmitted over arbitrary type of network andtransmission medium.

The term “comprises/comprising” when used in this specification is takento specify the presence of stated features, integers, steps orcomponents. However, the term does not preclude the presence or additionof one or more additional features, integers, steps or components orgroups thereof.

The invention is not restricted to the described embodiments in thefigures, but may be varied freely within the scope of the claims.

1. A method of processing digitized textual information in acomputerized database system, the information being organized in terms,documents and document corpora, where each document contains at leastone term and each document corpus contains at least one document, themethod comprising: generating, by using a computer, a concept vector foreach document in a document corpus wherein the concept vectorconceptually classifying contents of the document on a relativelycompact format, generating, for each term in the document corpus, aterm-to-concept vector describing a relationship between the term andeach of the concept vectors wherein the term-to-concept vectors beinggenerated on basis of the concept vectors, comprises: receiving theterm-to-concept vectors for the document corpus and on basis thereofgenerating a term-term matrix describing a term-to-term relationshipbetween the terms in the document corpus, wherein the generation of theterm-term matrix comprises: retrieving, for each term in eachcombination of two unique terms in the document corpus, a respectiveterm-to-concept vector, generating a relation vector describing therelationship between the terms in the each combination of two uniqueterms, each component in the relation vector being equal to a lowestcomponent value of corresponding component values in the term-to-conceptvectors, generating a relationship value for the each combination of twounique terms as the sum of all component values in the correspondingrelation vector, and generating a matrix containing the relationshipvalues of all combinations of two unique terms in the document corpus,processing the term-term matrix into processed textual information anddisplaying the processed textual information via a user outputinterface, and displaying the processed textual information as adistance graph in which each term constitutes a node wherein the noderepresenting a first term is connected to one or more other nodesrepresenting secondary terms to which the first term has a conceptualrelationship of at least a specific strength, and a relevance measurebetween the first term and at least one second term is represented by aminimum number of node hops between the first term and the at least onesecond term.
 2. The method according to claim 1, wherein each documentin the document corpus being associated with a document-concept matrixrepresenting at least one concept element whose relevance with respectto the document is described by a weight factor, the generation of eachterm-to-concept vector comprises: identifying a term-relevant set ofdocuments in the document corpus, each document in the term-relevant setcontaining at least one occurrence of the term, calculating a termweight for the term in each of the documents in the term-relevant set,retrieving a respective concept vector being associated with eachdocument in the term-relevant set where the term weight exceeds a firstthreshold value, selecting a relevant set of concept vectors includingany concept vector in which at least one concept component exceeds asecond threshold value, calculating a non-normalized term-to-conceptvector as the sum of all concept vectors in the relevant set, andnormalizing the non-normalized term-to-concept vector.
 3. The methodaccording to claim 1 wherein the method further comprises the steps of:calculating a statistical co-occurrence value between the eachcombination of two unique terms in the document corpus, the statisticalco-occurrence value describing a dependent probability that a certainsecond term exists in a document provided that a certain first termexists in the document, and incorporating the statistical co-occurrencevalues into the term-term matrix to represent lexical relationshipsbetween the terms in the document corpus.
 4. The method according toclaim 1 wherein the method further comprises the step of: displaying theprocessed textual information on a format being adapted for humancomprehension.
 5. The method according to claim 4, wherein thedisplaying step further comprises involving presentation of: at leastone document identifier specifying a document being relevant withrespect at least one term in a query, at least one term being related toa term in the query, and a conceptual distribution representing aconceptual relationship between two or more terms in the documentcorpus, the conceptual distribution being based on shared concepts whichare common to said terms.
 6. The method according to claim 4 wherein thedisplaying step further comprises involving presentation of at least onedocument identifier specifying a document being relevant with respect toat least one term in a query in combination with at least one userspecified concept.
 7. The method according to claim 5 wherein the methodfurther comprises the step of: selecting the at least one user specifiedconcept from the shared concepts in the conceptual distribution.
 8. Themethod according to claim 4 wherein the method further comprises thestep of: illustrating the conceptual relationship between the first termand the at least one second term by means of a respective relevancemeasure being associated with the at least one second term in respect ofthe first term.
 9. The method according to claim 8, wherein the methodfurther comprises the step of: displaying the processed textualinformation on a graphical format which visualizes the strength in theconceptual relationship between at least two terms.
 10. The methodaccording to claim 8 wherein the method further comprises the step of:displaying the processed textual information as a distance graph inwhich each term constitutes a node wherein the node representing thefirst term is connected to one or more other nodes representingsecondary terms to which the first term has a conceptual relationship,each connection is associated with an edge weight representing thestrength of a conceptual relationship between the first term and aparticular secondary term, and the relevance measure between the firstterm and the particular secondary term is represented by an accumulationof the edge weights being associated with the connections constituting aminimum number node hops between the first term and the particularsecondary term.
 11. The method according to claim 1, wherein each termfurther comprises: a single word, a proper name, a phrase, and acompound of single words.
 12. The method according to claim 1 furthercomprises the step of updating the document corpus with added data inform of at least one new document by means of identifying any addedterms in the new document which lack a representation in the documentcorpus, identifying any existing terms in the new document which wererepresented in the document corpus before adding the at least one newdocument, retrieving, for each of the existing terms, a correspondingconcept vector, generating a new concept vector with respect to the atleast one new document as a sum of the corresponding concept vectors,normalizing the new concept vector into a normalized new concept vector,and assigning the normalized new concept vector to each of the addedterms in the new document.
 13. A computer-implemented search engine,embedded on a computer readable storage medium, for processing an amountof digitized textual information and extracting data there from, theinformation being organized in terms, documents and document corpora,where each document contains at least one term and each document corpuscontains at least one document, comprising: an interface configured toreceive a query from a user, and a processing unit configured to processa document corpus on basis of the query and return processed textualinformation being relevant to the query said process involvinggenerating a concept vector for each document in the document corpus,the concept vector conceptually classifying contents of the document ona relatively compact format, and generating, for each term in thedocument corpus, a term-to-concept vector describing a relationshipbetween the term and each of the concept vectors, wherein the processingunit in turn comprises: a processing module configured to receive theterm-to-concept vectors for the document corpus and on basis thereofgenerate a term-term matrix describing a term-to-term relation-shipbetween the terms in the document corpus, wherein the generation of theterm-term matrix comprises: retrieving, for each term in eachcombination of two unique terms in the document corpus, a respectiveterm-to-concept vector, generating a relation vector describing therelationship between the terms in the each combination of two uniqueterms, each component in the relation vector being equal to a lowestcomponent value of corresponding component values in the term-to-conceptvectors, generating a relationship value for the each combination of twounique terms as the sum of all component values in the correspondingrelation vector, and generating a matrix containing the relationshipvalues of all combinations of two unique terms in the document corpus,an exploring module configured to receive the query and the term-termmatrix, and on basis of the query process the term-term matrix into theprocessed textual information, and a display module configured todisplay the processed textual information as a distance graph in whicheach term constitutes a node wherein the node representing a first termis connected to one or more other nodes representing secondary terms towhich the first term has a conceptual relationship of at least aspecific strength, and a relevance measure between the first term and atleast one second term is represented by a minimum number of node hopsbetween the first term and the at least one second term.
 14. Acomputer-implemented database system comprising: a processor: memoryholding an amount of digitized textual information being organized interms, documents and document corpora, wherein each document contains atleast one term and each document corpus contains at least one document,wherein each document in a document corpus being associated with conceptvector which conceptually classifies contents of the document on arelatively compact format, and wherein each term in the document corpusbeing associated with a term-to-concept vector describing a relationshipbetween the term and each of the concept vectors, delivering theterm-to-concept vectors to a search engine for processing an amount ofdigitized textual information and extracting data there from, theinformation being organized in terms, documents and document corpora,where each document contains at least one term and each document corpuscontains at least one document, and computer program instructionsimplementing: an interface configured to receive a query from a user,and a processing unit configured to process a document corpus on basisof the query and return processed textual information being relevant tothe query said process involving generating a concept vector for eachdocument in the document corpus, the concept vector conceptuallyclassifying the contents of the document on a relatively compact format,and generating, for each term in the document corpus, a term-to-conceptvector describing a relationship between the term and each of theconcept vectors, wherein the processing unit in turn comprises: aprocessing module configured to receive the term-to-concept vectors forthe document corpus and on basis thereof generate a term-term matrixdescribing a term-to-term relation-ship between the terms in thedocument corpus, wherein the generation of the term-term matrixcomprises: retrieving, for each term in each combination of two uniqueterms in the document corpus, a respective term-to-concept vector,generating a relation vector describing the reIationship between theterms in the each combination of two unique terms, each component in therelation vector being equal to a lowest component value of correspondingcomponent values in the term-to-concept vectors, generating arelationship value for the each combination of two unique terms as thesum of all component values in the corresponding relation vector, andgenerating a matrix containing the relationship values of allcombinations of two unique terms in the document corpus, an exploringmodule configured to receive the query and the term-term matrix, and onbasis of the query process the term-term matrix into the processedtextual information, and a display module configured to display theprocessed textual information as a distance graph in which each termconstitutes a node wherein the node representing a first term isconnected to one or more other nodes representing secondary terms towhich the first term has a conceptual relationship of at least aspecific strength, and a relevance measure between the first term and atleast one second term is represented by a minimum number of node hopsbetween the first term and the at least one second term.
 15. Thecomputer-implemented database system according to claim 14 furthercomprising an iterative term-to-concept engine configured to receivefresh digitized textual information added to the database and on basisof this information generate concept vectors for any added document, andgenerate a term-to-concept vector describing a relationship between anyadded term and each of the concept vectors.
 16. A server computer systemfor providing data processing services in respect of digitized textualinformation, wherein the server comprises: a processor; memory forstoring computer program instructions and data; and computer pograminstructions stored in the memory for implementing: a search engine forprocessing an amount of digitized textual information and extractingdata there from, the information being organized in terms, documents anddocument corpora, where each document contains at least one term andeach document corpus contains at least one document, comprising aninterface configured to receive a query from a user, and a processingunit configured to process a document corpus on basis of the query andreturn processed textual information being relevant to the query saidprocess involving generating a concept vector for each document in thedocument corpus, the concept vector conceptually classifying contents ofthe document on a relatively compact format, and generating, for eachterm in the document corpus, a term-to-concept vector describing arelationship between the term and each of the concept vectors, whereinthe processing unit in turn comprises a processing module configured toreceive the term-to-concept vectors for the document corpus and on basisthereof generate a term-term matrix describing a term-to-termrelation-ship between the terms in the document corpus, wherein thegeneration of the term-term matrix comprises: retrieving, for each termin each combination of two unique terms in the document corpus, arespective term-to-concept vector, generating a relation vectordescribing the relationship between the terms in the each combination oftwo unique terms, each component in the relation vector being equal to alowest component value of corresponding component values in theterm-to-concept vectors, generating a relationship value for the eachcombination of two unique terms as the sum of all component values inthe corresponding relation vector, and generating a matrix containingthe relationship values of all combinations of two unique terms in thedocument corpus, an exploring module configured to receive the query andthe term-term matrix, and on basis of the query process the term-termmatrix into the processed textual information, a display moduleconfigured to display the processed textual information as a distancegraph in which each term constitutes a node wherein the noderepresenting a first term is connected to one or more other nodesrepresenting secondary terms to which the first term has a conceptualrelationship of at least a specific strength, and a relevance measurebetween the first term and at least one second term is represented by aminimum number of node hops between the first term and the at least onesecond term, and a communication interface towards a database systemholding an amount of digitized textual information and configured todeliver the term-to concept vectors to the search engine.
 17. A computersystem comprising: a processor for executing computer programinstructions, a memory for storing computer program instructions andcomputer program instructions comprising software for processingdigitized textual information, the information being organized in terms,documents and document corpora, where each document contains at leastone term and each document corpus contains at least one document, thedigitized textual information processed by: generating a concept vectorfor each document in a document corpus wherein the concept vectorconceptually classifying the contents of the document on a relativelycompact format, generating, for each term in the document corpus, aterm-to-concept vector describing a relationship between the term andeach of the concept vectors wherein the term-to-concept vectors beinggenerated on basis of the concept vectors, receiving the term-to-conceptvectors for the document corpus and on basis thereof generating aterm-term matrix describing a term-to-term relationship between theterms in the document corpus, wherein the generation of the term-termmatrix comprises: retrieving, for each term in each combination of twounique terms in the document corpus, a respective term-to-conceptvector, generating a relation vector describing the relationship betweenthe terms in the each combination of two unique terms, each component inthe relation vector being equal to a lowest component value ofcorresponding component values in the term-to-concept vectors,generating a relationship value for the each combination of two uniqueterms as the sum of all component values in the corresponding relationvector, and generating a matrix containing the relationship values ofall combinations of two unique terms in the document corpus, processingthe term-term matrix into processed textual information and displayingthe processed textual information via a user output interface, anddisplaying the processed textual information as a distance graph inwhich each term constitutes a node wherein the node representing a firstterm is connected to one or more other nodes representing secondaryterms to which the first term has a conceptual relationship of at leasta specific strength, and a relevance measure between the first term andat least one second term is represented by a minimum number of node hopsbetween the first term and the at least one second term.
 18. A computerprogram product stored in a computer readable storage medium, thecomputer program product comprising: computer program instructionsrecorded thereon for causing a computer to process digitized textualinformation, the information being organized in terms, documents anddocument corpora, where each document contains at least one term andeach document corpus contains at least one document, the digitizedtextual information processed by: generating a concept vector for eachdocument in a document corpus wherein the concept vector conceptuallyclassifying contents of the document on a relatively compact format,generating, for each term in the document corpus, a term-to-conceptvector describing a relationship between the term and each of theconcept vectors wherein the term-to-concept vectors being generated onbasis of the concept vectors, receiving the term-to-concept vectors forthe document corpus and on basis thereof generating a term-term matrixdescribing a term-to-term relationship between the terms in the documentcorpus, wherein the generation of the term-term matrix comprises:retrieving, for each term in each combination of two unique terms in thedocument corpus, a respective term-to-concept vector, generating arelation vector describing the relationship between the terms in theeach combination of two unique terms, each component in the relationvector being equal to a lowest component value of correspondingcomponent values in the term-to-concept vectors, generating arelationship value for the each combination of two unique terms as thesum of all component values in the corresponding relation vector, andgenerating a matrix containing the relationship values of allcombinations of two unique terms in the document corpus, processing theterm-term matrix into processed textual information and displaying theprocessed textual information via a user output interface, anddisplaying the processed textual information as a distance graph inwhich each term constitutes a node wherein the node representing a firstterm is connected to one or more other nodes representing secondaryterms to which the first term has a conceptual relationship of at leasta specific strength, and a relevance measure between the first term andat least one second term is represented by a minimum number of node hopsbetween the first term and the at least one second term.